Researchers: Jean-Claude Bradley, Andrew Lang, Tony Williams, et. al.
[This page is a duplicate (backup) of the original model page on the ONSChallange wiki]
Introduction
Fragment-based models are, in general, less accurate than more sophisticated nonlinear models, such as random forests. However, fragment-based models, because of their speed, are useful for approximating properties for large compound libraries. Fragment-based models are also smaller, easier to deploy, and are open to better chemical interpretability than nonlinear models. Inspired by a blog post by OpenBabel developer Noel O'Boyle on how to deploy SMARTS fragment-based models using OpenBabel, we developed an open (CC0) fragment-based melting point model that can be incorporated into OpenBabel which had an R2: 0.61, AAE: 43.6, and RMSE: 58.1 when used to predict the melting points of a 3500 compound test-set. While our results are comparable to a previously published non-reproducible fragment-based model using the problematic PHYPROP database (659 compound test-set R2: 0.62, AAE: 48.9, RMSE: 48.8) [1], they pale in comparison to our current nonlinear models, which have an OOB R2: 0.83 and RMSE: 38.1.
Procedure
With the lack of a dedicated open fragment calculator for QSPR modeling, we decided to use OpenBabel itself as the modeling platform. We began by collecting a set of fragments by taking the union of the fragments from the current fragment-based models deployed in Open Babel (psa, logp, mr). We then used OpenBabel to calculate the fragment counts for the same 16015 compound training set used in MPModel007, by first assigning a value of 1 to all descriptors to get a total count, and then for each descriptor - one at a time - setting a value of zero (with all the rest one). By subtracting, we get a count for each descriptor. Doing this 167 times gives the needed fragment counts. I needed to do it this way because in OpenBabel: "Where several SMARTS strings assign values to the same atom, only the final assignment is retained." [2] (training set with fragment counts)
We removed the surprisingly large number of fragments with counts less than 5 ([#1], [#50], [O][#14], [O][#33], [O][#50], [O]C=[#7], [O]C=S, [*], [#9-*], [#17-*], [#35-*], [#53-*], [#53+*], [Fe,Cu,Zn,Tc,Cd,Pt,Au,Hg], [CH4], [CH2](C)C, [CH](C)(C)C, [CH3][O,N,F,Cl,Br,#15,#16,#53;!a], [CH2X4][O,N,F,Cl,Br,#15,#16,#53;!a], [CHX4][O,N,F,Cl,Br,#15,#16,#53;!a], [CH3]c, [CH3][a!#6], [CH2X4]a, [CHX4]a, [CX4][!#6;!#7;!#8;!#9;!#15;!#16;!#17;!#35;!#53;!#1], [N]#*, [N](=*)#*, [NH]1-*-*1, [N+](-*)(-*)(-*)-*, [N+](-*)(-*)=*, [N+](-*)#*, [NH+](-*)(-*)-*, [NH+](-*)=*, [NH2+](-*)-*, [NH2+]=*, [NH3+]-*, [n](:*):*, [n](:*)(:*):*, [n](-*)(:*):*, [n](=*)(:*):*, [nH](:*):*, [n+](:*)(:*):*, [n+](-*)(:*):*, [nH+](:*):*, [N+0](=a)a, [N+0](a)(a)a, [NH3+*], [NH2+*], [NH+*], [n+*], [NH0+*](=A)(A)a, [NH0+*](=[#6])=[#7], [N+*]#A, [N-*], [N+*](=[N-*])=N, [#8], [O](-*)-*, [O]1-*-*1, [OH]-*, [O-]-*, [o](:*):*, [OH2], [O]=[#8], [OX1-*][#7], [OX1-*][#16], [OX1-*][#15;#33;#43;#53], [O]=[CH2], [O]=[CX2]=O, [O]=C([a!#6])[a!#6], [O-1]C(=O), [#15], [P](-*)=*, [S](-*)-*, [S]=*, [S](-*)(-*)=*, [S](-*)(-*)(=*)=*, [SH]-*, [s](:*):*, [s](=*)(:*):*, [S-*], [S+*]). Note: a lot of these fragments may have individual counts greater than five but overall they end up with counts less than five due to the way OpenBabel assigns counts to multiple fragments on the same atom.
We removed the four correlated descriptors: [C]=c, [O]=[#7], [O]=c, [PH](-*)(-*)=* by using the caret package and the following code in R
Following the instructions on the OpenBabel plugin development page, we deployed the melting point model plugin by creating the appropriate text file (mpC.txt) and by updating the plugindefines.txt file in the data folder. We then used OpenBabel to calculate the melting points of the 3500 compound test set with an R2: 0.67, AAE: 43.6, and RMSE: 58.1, illustrated below (test set with statistics).
Results
We successfully created an open fragment-based model that can easily be added to OpenBabel. We used R to create the linear model and OpenBabel as the fragment calculator for an open data set of melting points. The test-set statistics R2: 0.61, AAE: 43.6, and RMSE: 58.1, can likely be improved upon by further curation of the open melting point data (or by using our less diverse but higher quality doubleplusgood dataset), by introducing new fragments, by optimization of the fragment order, and by using more sophisticated feature selection and linear modeling techniques than used here.
Modeling Melting Points Using Open Babel
Researchers: Jean-Claude Bradley, Andrew Lang, Tony Williams, et. al.[This page is a duplicate (backup) of the original model page on the ONSChallange wiki]
Introduction
Fragment-based models are, in general, less accurate than more sophisticated nonlinear models, such as random forests. However, fragment-based models, because of their speed, are useful for approximating properties for large compound libraries. Fragment-based models are also smaller, easier to deploy, and are open to better chemical interpretability than nonlinear models. Inspired by a blog post by OpenBabel developer Noel O'Boyle on how to deploy SMARTS fragment-based models using OpenBabel, we developed an open (CC0) fragment-based melting point model that can be incorporated into OpenBabel which had an R2: 0.61, AAE: 43.6, and RMSE: 58.1 when used to predict the melting points of a 3500 compound test-set. While our results are comparable to a previously published non-reproducible fragment-based model using the problematic PHYPROP database (659 compound test-set R2: 0.62, AAE: 48.9, RMSE: 48.8) [1], they pale in comparison to our current nonlinear models, which have an OOB R2: 0.83 and RMSE: 38.1.Procedure
With the lack of a dedicated open fragment calculator for QSPR modeling, we decided to use OpenBabel itself as the modeling platform. We began by collecting a set of fragments by taking the union of the fragments from the current fragment-based models deployed in Open Babel (psa, logp, mr). We then used OpenBabel to calculate the fragment counts for the same 16015 compound training set used in MPModel007, by first assigning a value of 1 to all descriptors to get a total count, and then for each descriptor - one at a time - setting a value of zero (with all the rest one). By subtracting, we get a count for each descriptor. Doing this 167 times gives the needed fragment counts. I needed to do it this way because in OpenBabel: "Where several SMARTS strings assign values to the same atom, only the final assignment is retained." [2] (training set with fragment counts)We removed the surprisingly large number of fragments with counts less than 5 ([#1], [#50], [O][#14], [O][#33], [O][#50], [O]C=[#7], [O]C=S, [*], [#9-*], [#17-*], [#35-*], [#53-*], [#53+*], [Fe,Cu,Zn,Tc,Cd,Pt,Au,Hg], [CH4], [CH2](C)C, [CH](C)(C)C, [CH3][O,N,F,Cl,Br,#15,#16,#53;!a], [CH2X4][O,N,F,Cl,Br,#15,#16,#53;!a], [CHX4][O,N,F,Cl,Br,#15,#16,#53;!a], [CH3]c, [CH3][a!#6], [CH2X4]a, [CHX4]a, [CX4][!#6;!#7;!#8;!#9;!#15;!#16;!#17;!#35;!#53;!#1], [N]#*, [N](=*)#*, [NH]1-*-*1, [N+](-*)(-*)(-*)-*, [N+](-*)(-*)=*, [N+](-*)#*, [NH+](-*)(-*)-*, [NH+](-*)=*, [NH2+](-*)-*, [NH2+]=*, [NH3+]-*, [n](:*):*, [n](:*)(:*):*, [n](-*)(:*):*, [n](=*)(:*):*, [nH](:*):*, [n+](:*)(:*):*, [n+](-*)(:*):*, [nH+](:*):*, [N+0](=a)a, [N+0](a)(a)a, [NH3+*], [NH2+*], [NH+*], [n+*], [NH0+*](=A)(A)a, [NH0+*](=[#6])=[#7], [N+*]#A, [N-*], [N+*](=[N-*])=N, [#8], [O](-*)-*, [O]1-*-*1, [OH]-*, [O-]-*, [o](:*):*, [OH2], [O]=[#8], [OX1-*][#7], [OX1-*][#16], [OX1-*][#15;#33;#43;#53], [O]=[CH2], [O]=[CX2]=O, [O]=C([a!#6])[a!#6], [O-1]C(=O), [#15], [P](-*)=*, [S](-*)-*, [S]=*, [S](-*)(-*)=*, [S](-*)(-*)(=*)=*, [SH]-*, [s](:*):*, [s](=*)(:*):*, [S-*], [S+*]). Note: a lot of these fragments may have individual counts greater than five but overall they end up with counts less than five due to the way OpenBabel assigns counts to multiple fragments on the same atom.
We removed the four correlated descriptors: [C]=c, [O]=[#7], [O]=c, [PH](-*)(-*)=* by using the caret package and the following code in R
Again using R, we developed a simple linear model and performed further feature selection to attain the fragment coefficients.
## load in data mydata = read.csv(file="TrainingWithDescriptorsReadyForLinearRegression.csv", head=TRUE,row.names="molID") fit <- lm(mpC ~ 0 + tf1 + tf2 + tf4 + tf5 + tf6 + tf7 + tf9 + tf10 + tf11 + tf12 + tf14 + tf17 + tf18 + tf20 + tf22 + tf23 + tf25 + tf26 + tf27 + tf28 + tf34 + tf36 + tf40 + tf44 + tf45 + tf46 + tf47 + tf48 + tf49 + tf50 + tf55 + tf56 + tf57 + tf58 + tf59 + tf60 + tf61 + tf62 + tf63 + tf64 + tf65 + tf66 + tf67 + tf68 + tf69 + tf70 + tf71 + tf72 + tf73 + tf76 + tf77 + tf78 + tf79 + tf81 + tf83 + tf84 + tf86 + tf87 + tf104 + tf105 + tf106 + tf109 + tf113 + tf115 + tf124 + tf128 + tf129 + tf131 + tf132 + tf133 + tf134 + tf135 + tf142 + tf143 + tf144 + tf147 + tf148 + tf149 + tf153 + tf155 + tf164 + tf167, data=mydata) ## step-wise feature selection to remove non-significant features bestfit <- stepAIC(fit, upper=~., lower=~1, direction="both") ## output results summary(bestfit) [output] Call: lm(formula = mpC ~ tf2 + tf4 + tf5 + tf9 + tf10 + tf11 + tf12 + tf14 + tf17 + tf20 + tf22 + tf23 + tf25 + tf26 + tf28 + tf34 + tf36 + tf40 + tf44 + tf45 + tf46 + tf47 + tf48 + tf50 + tf55 + tf56 + tf57 + tf58 + tf59 + tf60 + tf61 + tf62 + tf63 + tf64 + tf65 + tf66 + tf67 + tf68 + tf69 + tf70 + tf71 + tf72 + tf73 + tf76 + tf78 + tf79 + tf81 + tf84 + tf86 + tf87 + tf104 + tf106 + tf109 + tf113 + tf115 + tf124 + tf131 + tf132 + tf144 + tf147 + tf148 + tf155 + tf164 - 1, data = mydata) Residuals: Min 1Q Median 3Q Max -784.85 -32.22 0.41 35.55 351.06 Coefficients: Estimate Std. Error t value Pr(>|t|) tf2 -11.2891 0.3411 -33.094 < 2e-16 *** tf4 -8.0352 5.1829 -1.550 0.121079 tf5 -36.0534 5.0458 -7.145 9.37e-13 *** tf9 32.1543 2.2267 14.440 < 2e-16 *** tf10 7.3289 0.8070 9.082 < 2e-16 *** tf11 22.9132 1.3487 16.989 < 2e-16 *** tf12 48.0277 2.6471 18.143 < 2e-16 *** tf14 26.5074 4.4937 5.899 3.74e-09 *** tf17 42.6169 4.3065 9.896 < 2e-16 *** tf20 44.4787 1.3334 33.356 < 2e-16 *** tf22 24.8303 13.6538 1.819 0.068998 . tf23 40.0567 7.7142 5.193 2.10e-07 *** tf25 -15.2212 0.5951 -25.577 < 2e-16 *** tf26 -8.2654 0.9197 -8.987 < 2e-16 *** tf28 14.7222 4.9566 2.970 0.002980 ** tf34 17.7551 2.5995 6.830 8.79e-12 *** tf36 34.2188 8.6157 3.972 7.17e-05 *** tf40 40.6006 1.1303 35.920 < 2e-16 *** tf44 32.9369 1.2416 26.527 < 2e-16 *** tf45 20.7267 0.7192 28.818 < 2e-16 *** tf46 -14.8687 2.5568 -5.815 6.17e-09 *** tf47 15.5201 0.8502 18.255 < 2e-16 *** tf48 18.9400 0.9590 19.750 < 2e-16 *** tf50 8.0103 1.6218 4.939 7.93e-07 *** tf55 28.8598 1.7874 16.147 < 2e-16 *** tf56 27.9572 2.0982 13.325 < 2e-16 *** tf57 19.6289 1.1059 17.750 < 2e-16 *** tf58 25.3437 1.1923 21.257 < 2e-16 *** tf59 26.4790 1.1996 22.073 < 2e-16 *** tf60 10.0405 5.5401 1.812 0.069954 . tf61 11.6743 0.3470 33.641 < 2e-16 *** tf62 22.2839 0.6271 35.536 < 2e-16 *** tf63 18.3672 0.8946 20.530 < 2e-16 *** tf64 17.1777 0.5595 30.702 < 2e-16 *** tf65 20.7165 1.1950 17.335 < 2e-16 *** tf66 18.4476 0.7396 24.942 < 2e-16 *** tf67 17.1867 2.1395 8.033 1.02e-15 *** tf68 31.1712 6.9949 4.456 8.40e-06 *** tf69 43.4894 9.0487 4.806 1.55e-06 *** tf70 35.4438 2.0479 17.307 < 2e-16 *** tf71 18.5955 3.2410 5.738 9.78e-09 *** tf72 16.8983 8.4073 2.010 0.044454 * tf73 26.0819 1.6811 15.515 < 2e-16 *** tf76 21.9141 0.7073 30.982 < 2e-16 *** tf78 13.9251 1.2672 10.989 < 2e-16 *** tf79 7.7621 1.4669 5.291 1.23e-07 *** tf81 19.4670 1.5937 12.215 < 2e-16 *** tf84 -4.9750 2.4064 -2.067 0.038711 * tf86 -22.2462 7.8444 -2.836 0.004575 ** tf87 -21.3490 4.7744 -4.472 7.82e-06 *** tf104 -34.4811 4.7166 -7.311 2.79e-13 *** tf106 -32.8638 16.7665 -1.960 0.050003 . tf109 13.9377 2.2851 6.099 1.09e-09 *** tf113 15.8908 0.6518 24.380 < 2e-16 *** tf115 96.6851 11.5045 8.404 < 2e-16 *** tf124 10.3334 0.8373 12.341 < 2e-16 *** tf131 -8.9495 0.7989 -11.202 < 2e-16 *** tf132 -6.6031 1.9248 -3.431 0.000604 *** tf144 -27.6702 14.3434 -1.929 0.053732 . tf147 8.5440 3.4518 2.475 0.013324 * tf148 4.8169 2.8596 1.684 0.092105 . tf155 11.6943 4.6059 2.539 0.011127 * tf164 8.9702 1.5402 5.824 5.86e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 57.25 on 15952 degrees of freedom Multiple R-squared: 0.836, Adjusted R-squared: 0.8353 F-statistic: 1290 on 63 and 15952 DF, p-value: < 2.2e-16 [output]Following the instructions on the OpenBabel plugin development page, we deployed the melting point model plugin by creating the appropriate text file (mpC.txt) and by updating the plugindefines.txt file in the data folder. We then used OpenBabel to calculate the melting points of the 3500 compound test set with an R2: 0.67, AAE: 43.6, and RMSE: 58.1, illustrated below (test set with statistics).
Results
We successfully created an open fragment-based model that can easily be added to OpenBabel. We used R to create the linear model and OpenBabel as the fragment calculator for an open data set of melting points. The test-set statistics R2: 0.61, AAE: 43.6, and RMSE: 58.1, can likely be improved upon by further curation of the open melting point data (or by using our less diverse but higher quality doubleplusgood dataset), by introducing new fragments, by optimization of the fragment order, and by using more sophisticated feature selection and linear modeling techniques than used here.References
1. Clark M. Generalized Fragment-Substructure Based Property Prediction Method. J. Chem. Inf. Model. 2005, 45, 30-382. OpenBabel plugin development document: http://open-babel.readthedocs.org/en/latest/WritePlugins/AddNewDescriptor.html